ENH: Optimize nrows in read_excel #35974

MarcoGorelli · 2020-08-29T10:15:23Z

closes read_excel opimize nrows #32727
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

based on #33281

output of asv benchmarks:

(pandas-dev) marco@marco-Predator-PH315-52:~/pandas-dev/asv_bench$ asv continuous -f 1.1 upstream/master optimise-nrows-excel -b excel.ReadExcel
· Creating environments..................................................................................................................................
· Discovering benchmarks
·· Uninstalling from conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Building d0a8a687 <optimise-nrows-excel> for conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....................................
·· Installing d0a8a687 <optimise-nrows-excel> into conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
· Running 4 total benchmarks (2 commits * 1 environments * 2 benchmarks)
[  0.00%] · For pandas commit c413df6d <master> (round 1/2):
[  0.00%] ·· Building for conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....................................
[  0.00%] ·· Benchmarking conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 12.50%] ··· Setting up io.excel:62                                                                                                                                                                               ok
[ 12.50%] ··· Running (io.excel.ReadExcel.time_read_excel--)..
[ 25.00%] · For pandas commit d0a8a687 <optimise-nrows-excel> (round 1/2):
[ 25.00%] ·· Building for conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
[ 25.00%] ·· Benchmarking conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 37.50%] ··· Setting up io.excel:62                                                                                                                                                                               ok
[ 37.50%] ··· Running (io.excel.ReadExcel.time_read_excel--)..
[ 50.00%] · For pandas commit d0a8a687 <optimise-nrows-excel> (round 2/2):
[ 50.00%] ·· Benchmarking conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 62.50%] ··· Setting up io.excel:62                                                                                                                                                                               ok
[ 62.50%] ··· io.excel.ReadExcel.time_read_excel                                                                                                                                                                   ok
[ 62.50%] ··· ========== ============
                engine               
              ---------- ------------
                 xlrd      953±6ms   
               openpyxl   1.66±0.03s 
                 odf      6.02±0.02s 
              ========== ============

[ 75.00%] ··· io.excel.ReadExcel.time_read_excel_nrows                                                                                                                                                             ok
[ 75.00%] ··· ========== ============
                engine               
              ---------- ------------
                 xlrd      878±20ms  
               openpyxl   1.67±0.02s 
                 odf      4.58±0.04s 
              ========== ============

[ 75.00%] · For pandas commit c413df6d <master> (round 2/2):
[ 75.00%] ·· Building for conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
[ 75.00%] ·· Benchmarking conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 87.50%] ··· Setting up io.excel:62                                                                                                                                                                               ok
[ 87.50%] ··· io.excel.ReadExcel.time_read_excel                                                                                                                                                                   ok
[ 87.50%] ··· ========== ============
                engine               
              ---------- ------------
                 xlrd      941±5ms   
               openpyxl   1.69±0.02s 
                 odf      6.15±0.04s 
              ========== ============

[100.00%] ··· io.excel.ReadExcel.time_read_excel_nrows                                                                                                                                                             ok
[100.00%] ··· ========== ============
                engine               
              ---------- ------------
                 xlrd      971±20ms  
               openpyxl   1.69±0.01s 
                 odf      6.07±0.03s 
              ========== ============

       before           after         ratio
     [c413df6d]       [d0a8a687]
     <master>         <optimise-nrows-excel>
-        971±20ms         878±20ms     0.90  io.excel.ReadExcel.time_read_excel_nrows('xlrd')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

WillAyd · 2020-09-02T21:05:34Z

pandas/io/excel/_base.py

@@ -453,7 +491,20 @@ def parse(
            else:  # assume an integer if not a string
                sheet = self.get_sheet_by_index(asheetname)

-            data = self.get_sheet_data(sheet, convert_float)
+            get_sheet_data_header = 0 if header is None else header


Is this required? Looks like there are already checks within the subsequent functions for int values, no?

WillAyd · 2020-09-02T21:05:57Z

pandas/io/excel/_xlrd.py

+
+        for i in range(sheet_nrows):
+            if self.should_skip_row(i, header, skiprows, nrows):
+                data.append([])


Do we need to append the empty list here? Would be preferable to just continue

I removed should_skip_row entirely, I think the benefit from the original PR came from not reading all the rows into memory, rather than from optimising skipping rows

MarcoGorelli · 2020-09-04T15:42:49Z

Performance benchmarks:

(pandas-dev) marco@marco-Predator-PH315-52:~/pandas-dev/asv_bench$ asv continuous -f 1.1 upstream/master optimise-nrows-excel -b excel.ReadExcel
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Building 2d9ee8de <optimise-nrows-excel> for conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.................................
·· Installing 2d9ee8de <optimise-nrows-excel> into conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
· Running 4 total benchmarks (2 commits * 1 environments * 2 benchmarks)
[  0.00%] · For pandas commit b53dc8f8 <optimise-nrows-excel^2> (round 1/2):
[  0.00%] ·· Building for conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.....................................
[  0.00%] ·· Benchmarking conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 12.50%] ··· Setting up io.excel:62                                                                                                                                                                    ok
[ 12.50%] ··· Running (io.excel.ReadExcel.time_read_excel--).
[ 25.00%] ··· Running (io.excel.ReadExcel.time_read_excel_nrows--).
[ 25.00%] · For pandas commit 2d9ee8de <optimise-nrows-excel> (round 1/2):
[ 25.00%] ·· Building for conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
[ 25.00%] ·· Benchmarking conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 37.50%] ··· Setting up io.excel:62                                                                                                                                                                    ok
[ 37.50%] ··· Running (io.excel.ReadExcel.time_read_excel--).
[ 50.00%] ··· Running (io.excel.ReadExcel.time_read_excel_nrows--).
[ 50.00%] · For pandas commit 2d9ee8de <optimise-nrows-excel> (round 2/2):
[ 50.00%] ·· Benchmarking conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 62.50%] ··· Setting up io.excel:62                                                                                                                                                                    ok
[ 62.50%] ··· io.excel.ReadExcel.time_read_excel                                                                                                                                                        ok
[ 62.50%] ··· ========== ============
                engine               
              ---------- ------------
                 xlrd     1.01±0.02s 
               openpyxl   1.85±0.02s 
                 odf      6.04±0.03s 
              ========== ============

[ 75.00%] ··· io.excel.ReadExcel.time_read_excel_nrows                                                                                                                                                  ok
[ 75.00%] ··· ========== ============
                engine               
              ---------- ------------
                 xlrd      877±8ms   
               openpyxl   1.83±0.01s 
                 odf      4.57±0.04s 
              ========== ============

[ 75.00%] · For pandas commit b53dc8f8 <optimise-nrows-excel^2> (round 2/2):
[ 75.00%] ·· Building for conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
[ 75.00%] ·· Benchmarking conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 87.50%] ··· Setting up io.excel:62                                                                                                                                                                    ok
[ 87.50%] ··· io.excel.ReadExcel.time_read_excel                                                                                                                                                        ok
[ 87.50%] ··· ========== ============
                engine               
              ---------- ------------
                 xlrd      974±20ms  
               openpyxl   1.82±0.04s 
                 odf      6.21±0.1s  
              ========== ============

[100.00%] ··· io.excel.ReadExcel.time_read_excel_nrows                                                                                                                                                  ok
[100.00%] ··· ========== ============
                engine               
              ---------- ------------
                 xlrd      960±9ms   
               openpyxl    1.79±0s   
                 odf      5.87±0.01s 
              ========== ============

       before           after         ratio
     [b53dc8f8]       [2d9ee8de]
     <optimise-nrows-excel^2>       <optimise-nrows-excel>
-      5.87±0.01s       4.57±0.04s     0.78  io.excel.ReadExcel.time_read_excel_nrows('odf')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

The OP mentioned a memory leak from skipping rows in openpyxl, which I experienced too, so I've skipped optimising those files for now

jreback · 2020-09-13T20:43:25Z

looks good, i thought this would make a much bigger difference, but ok.

cc @WillAyd

WillAyd · 2020-09-21T21:47:21Z

Thanks @MarcoGorelli

jbrockmendel · 2020-09-22T00:01:09Z

pandas/io/excel/_base.py

+            elif skiprows is None:
+                skiprows_nrows = 0
+            else:
+                skiprows_nrows = len(skiprows)


this looks like its causing failures on master

yep: https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=43000&view=logs&j=b1c7b65e-b3ce-541a-7fd5-29b4ba56ce18&t=46ecc253-e38f-5abc-2ea7-addca6b44d0a

@jbrockmendel can you revert for now and we can reopen

This reverts commit e975f3d.

jreback · 2020-09-22T01:50:52Z

@MarcoGorelli we are reverting this as something is broken in the evaluation here. If you can resubmit when you can.

This reverts commit e975f3d.

MarcoGorelli · 2020-09-22T06:53:31Z

Sure - I'm really sorry for the breakage caused, but glad this was caught early!

…-dev#36537) This reverts commit e975f3d.

mproszewska added 30 commits March 27, 2020 19:56

ENH: Skip rows while reading excel file with engine=openpyxl

900afff

ENH: Skiping rows with odf engine

df55b51

ENH: Optimize nrows in read_excel

8177024

Reformatted

79b34c3

Fix linting

f0a2b8d

Add annotation to variable

70ac234

Add imports

27cae3a

Add types

4248f8c

ENH: Fix

70f46b3

ENH: Mark variables as optional

cdfc05d

Merge branch 'master' into excel

502b5e3

ENH: Move nrows variable check

4c8a42a

ENH: Remove unused imports

19bb927

ENH: Move repeated code to base

6c2a3b5

ENH: Remove import

b865c88

ENH: Lint

49276da

ENH: Lint

393a622

ENH: Add docstring to should_read_row

e00fff1

ENH: Lint

b14642b

ENH: Lint

dfc794a

ENH: Move nrows value check

7b501de

ENH: Remove nrows validation

3292f6b

Run tests

bdd5780

ENH: Fix reading rows in openpyxl

1867088

ENH: Fix lint

3c1eb10

Fix max_row variable definition

88c3117

Fix max_row variable definition

dc60055

Add typed in should_read_row function

6fdedfd

Add types and tests

ba7175c

Add whatsnew

d884803

WillAyd reviewed Sep 2, 2020

View reviewed changes

WillAyd added the IO Excel read_excel, to_excel label Sep 2, 2020

MarcoGorelli added 8 commits September 4, 2020 10:50

Merge remote-tracking branch 'upstream/master' into optimise-nrows-excel

8f6ead4

simplify

98e4093

optimise other readers too

b49e8a9

type

04ed5e8

expand test

0eb3200

move whatsnew note

e88fe1e

skip pyxl

28e51e6

Merge remote-tracking branch 'upstream/master' into optimise-nrows-excel

2d9ee8d

MarcoGorelli marked this pull request as ready for review September 4, 2020 15:41

Merge remote-tracking branch 'upstream/master' into optimise-nrows-excel

bdb5630

jreback added this to the 1.2 milestone Sep 13, 2020

jreback added the Performance Memory or execution speed performance label Sep 13, 2020

Merge branch 'master' into optimise-nrows-excel

b242ca3

WillAyd approved these changes Sep 21, 2020

View reviewed changes

WillAyd merged commit e975f3d into pandas-dev:master Sep 21, 2020

jbrockmendel reviewed Sep 22, 2020

View reviewed changes

jbrockmendel added a commit that referenced this pull request Sep 22, 2020

Revert "ENH: Optimize nrows in read_excel (#35974)"

728b377

This reverts commit e975f3d.

jbrockmendel mentioned this pull request Sep 22, 2020

Revert "ENH: Optimize nrows in read_excel" #36537

Merged

jreback pushed a commit that referenced this pull request Sep 22, 2020

Revert "ENH: Optimize nrows in read_excel (#35974)" (#36537)

f3c12fb

This reverts commit e975f3d.

MarcoGorelli deleted the optimise-nrows-excel branch September 22, 2020 17:00

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020

ENH: Optimize nrows in read_excel (pandas-dev#35974)

c8b848d

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020

Revert "ENH: Optimize nrows in read_excel (pandas-dev#35974)" (pandas…

9f43fb4

…-dev#36537) This reverts commit e975f3d.

lithomas1 mentioned this pull request Feb 2, 2021

ENH: Using nrows option while processing xlsb files #39518

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Optimize nrows in read_excel #35974

ENH: Optimize nrows in read_excel #35974

MarcoGorelli commented Aug 29, 2020 •

edited

Loading

WillAyd Sep 2, 2020

WillAyd Sep 2, 2020

MarcoGorelli Sep 4, 2020

MarcoGorelli commented Sep 4, 2020 •

edited

Loading

jreback commented Sep 13, 2020

WillAyd commented Sep 21, 2020

jbrockmendel Sep 22, 2020

jreback Sep 22, 2020

jreback commented Sep 22, 2020

MarcoGorelli commented Sep 22, 2020

ENH: Optimize nrows in read_excel #35974

ENH: Optimize nrows in read_excel #35974

Conversation

MarcoGorelli commented Aug 29, 2020 • edited Loading

WillAyd Sep 2, 2020

Choose a reason for hiding this comment

WillAyd Sep 2, 2020

Choose a reason for hiding this comment

MarcoGorelli Sep 4, 2020

Choose a reason for hiding this comment

MarcoGorelli commented Sep 4, 2020 • edited Loading

jreback commented Sep 13, 2020

WillAyd commented Sep 21, 2020

jbrockmendel Sep 22, 2020

Choose a reason for hiding this comment

jreback Sep 22, 2020

Choose a reason for hiding this comment

jreback commented Sep 22, 2020

MarcoGorelli commented Sep 22, 2020

MarcoGorelli commented Aug 29, 2020 •

edited

Loading

MarcoGorelli commented Sep 4, 2020 •

edited

Loading